Search Result

Select

Detection of negative emotion burst topic in microblog text stream

LI Yanhong, ZHAO Hongwei, WANG Suge, LI Deyu

Journal of Computer Applications 2020, 40 (12): 3458-3464. DOI: 10.11772/j.issn.1001-9081.2020060880

Abstract （305）

PDF （1188KB）（400）

Save

How to find negative emotion burst topic in time from massive and noisy microblog text stream is essential for emergency response and handling of emergencies. However, the traditional burst topic detection methods often ignore the differences between negative emotion burst topic and non-negative emotion burst topic. Therefore, a Negative Emotion Burst Topic Detection (NE-BTD) algorithm for microblog text stream was proposed. Firstly, the accelerations of keyword pairs in microblog and the change rate of negative emotion intensity were used as the basis for judging the topics of negative emotion. Secondly, the speeds of burst word pairs were used to determine the window range of negative emotion burst topics. Finally, a Gibbs Sampling Dirichlet Multinomial Mixture model (GSDMM) clustering algorithm was used to obtain the topic structures of the negative emotion burst topics in the window. In the experiments, the proposed NE-BTD algorithm was compared with an existing Emotion-Based Method of Topic Detection (EBM-TD) algorithm. The results show that the NE-BTD algorithm was at least 20% higher in accuracy and recall than the EBM-TD algorithm, and it can detect negative emotion burst topic at least 40 minutes earlier.

Reference | Related Articles | Metrics

Select

Fast unsupervised feature selection algorithm based on rough set theory

BAI Hexiang, WANG Jian, LI Deyu, CHEN Qian

Journal of Computer Applications 2015, 35 (8): 2355-2359. DOI: 10.11772/j.issn.1001-9081.2015.08.2355

Abstract （602）

PDF （773KB）（349）

Save

Focusing on the issue that feature selection for the usually encountered large scale data sets in the "big data" is too slow to meet the practical requirements, a fast feature selection algorithm for unsupervised massive data sets was proposed based on the incremental absolute reduction algorithm in traditional rough set theory. Firstly, the large scale data set was regarded as a random object sequence and the candidate reduct was set empty. Secondly, random object was one by one drawn from the large scale data set without replacement; next, each random drawn object was checked if it could be distinguished with the other objects in the current object set and then merged with current object set, if the new object could not be distinguished using the candidate reduct, a new attribute that can distinguish the new object should be added into the candidate reduct. Finally, if successive I objects were distinguishable using the candidate reduct, the candidate reduct was used as the reduct of the large scale data set. Experiments on five unsupervised large-scale data sets demonstrated that a reduct which can distinguish no less than 95% object pairs could be found within 1% time needed by the discernibility matrix based algorithm and incremental absolute reduction algorithm. In the experiment of the text topic mining, the topic found by the reducted data set was consistent with that of the original data set. The experimental results show that the proposed algorithm can obtain effective reducts for large scale data set in practical time.

Reference | Related Articles | Metrics

Select

Kernel improvement of multi-label feature extraction method

LI Hua, LI Deyu, WANG Suge, ZHANG Jing

Journal of Computer Applications 2015, 35 (7): 1939-1944. DOI: 10.11772/j.issn.1001-9081.2015.07.1939

Abstract （519）

PDF （997KB）（495）

Save

Focusing on the issue that the label kernel functions do not take the correlation between labels into consideration in the multi-label feature extraction method, two construction methods of new label kernel functions were proposed. In the first method, the multi-label data were transformed into single-label data, and thus the correlation between labels could be characterized by the label set; then a new label kernel function was defined from the perspective of loss function of single-label data. In the second method, mutual information was used to characterize the correlation between labels, and a new label kernel function was proposed from the perspective of mutual information. Experiments on three real-life data sets using two multi-label classifiers demonstrated that the best method of all measures was feature extraction method with label kernel function based on loss function and the performance of five evaluation measures on average increased by 10%; especially on the data set Yeast, the evaluation measure Coverage reached a decline of about 30%. Closely followed by feature extraction method with label kernel function based on mutual information and the performance of five evaluation measures on average increased by 5%. The theoretical analysis and simulation results show that the feature extraction methods based on new output kernel functions can effectively extract features, simplify learning process of multi-label classifiers and, moreover, improve the performance of multi-label classification.

Reference | Related Articles | Metrics

Select

Real-time detection framework for network intrusion based on data stream

LI Yanhong, LI Deyu, CUI Mengtian, LI Hua

Journal of Computer Applications 2015, 35 (2): 416-419. DOI: 10.11772/j.issn.1001-9081.2015.02.0416

Abstract （546）

PDF （792KB）（424）

Save

The access request for computer network has the characteristics of real-time and dynamic change. In order to detect network intrusion in real time and be adapted to the dynamic change of network access data, a real-time detection framework for network intrusion was proposed based on data stream. First of all, misuse detection model and anomaly detection model were combined. A knowledge base was established by the initial clustering which was made up of normal patterns and abnormal patterns. Secondly, the similarity between network access data and normal pattern and abnormal pattern was measured using the dissimilarity between data point and data cluster, and the legitimacy of network access data was determined. Finally, when network access data stream evolved, the knowledge base was updated by reclustering to reflect the state of network access. Experiments on intrusion detection dataset KDDCup99 show that, when initial clustering samples are 10000, clustering samples in buffer are 10000, adjustment coefficient is 0.9, the proposed framework achieves a recall rate of 91.92% and a false positive rate of 0.58%. It approaches the result of the traditional non-real-time detection model, but the whole process of learning and detection only scans network access data once. With the introduction of knowledge base update mechanism, the proposed framework is more advantageous in the real-time performance and adaptability of intrusion detection.

Reference | Related Articles | Metrics